GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Table 1

9タスク

All tasks are binary classification, except STS-B (regression) and MNLI (three classes).

Single-Sentence Tasks

CoLA

Corpus of Linguistic Acceptability

SST-2

Stanford Sentiment Treebank

Similarity and Paraphrase Tasks

MRPC

Microsoft Research Paraphrase Corpus

STS-B

Semantic Textual Similarity Benchmark

QQP

Quora Question Pairs

Inference Tasks

MNLI

Multi-Genre NLI corpus

NLIとは natural language inference（推論）

QNLI

SQuADをNLIとしてrecast

RTE

Recognizing Textual Entailment

WNLI

Winograd Schema ChallengeをNLIとしてrecast

We evaluate baselines that use ELMo (略) as well as state-of-the-art sentence repre- sentation models